[serve][1/n] Introduce gang scheduling by jeffreywang-anyscale · Pull Request #60802 · ray-project/ray

jeffreywang-anyscale · 2026-02-06T07:53:50Z

Description

This PR introduces gang scheduling support to Ray Serve, enabling atomic scheduling of replica groups. Gang scheduling ensures that sets of replicas are scheduled together -- either all succeed or all fail -- which is critical for distributed serving patterns requiring tight coordination between replicas or across multiple deployments. This is a stepping stone to achieve DP group fault tolerance in Ray Serve LLM.

Key decisions

Reserve replicas resources for a gang atomically with PG, and use the reserved PG to spin up replica actors.

Out of scope for this PR

RESTART_REPLICA runtime failure policy
Autoscaling: Scale up and down with gang quantization
Inter-gang placement, e.g. SPREAD across gangs & PACK within gang
Node / Label affinity placement
Metrics

Test approach

Basic validation

Basic gang deployment with @serve.deployment succeeds and responds to requests
Basic gang deployment with .options succeeds and responds to requests
Gang deployment with insufficient resources can serve requests with scheduled gangs
No partial gangs: All gangs has gang_size replicas

Failure validation

Gang deployment with invalid configuration
Gang deployment with failed scheduling
Gang deployment with failed actors
Timeout behavior should be the same as standard replica scheduling

Placement strategy

Intra-gang placement strategy (PACK and SPREAD)
Inter-gang placement strategy

Failover / Fault Tolerance

RESTART_GANG runtime failure policy tears down the entire gang, and the following reconciliation loop brings a new gang back up. In the meanwhile, the gang deployment should keep serving traffic with no downtime.

Related issues

RFC: https://docs.google.com/document/d/1IzLTRJo-B8YF74eAKJA6XuDqZtzGOzEtJ1WjkMzroq4/edit?pli=1&tab=t.0

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The changes are extensive, touching core components like the deployment scheduler and state management, adding new configuration options, and providing a solid suite of tests. My review focuses on a few areas for improvement: enhancing type safety in data classes, improving constant management for better code clarity, making cleanup logic more robust by adding logging, and simplifying some conditional logic. Overall, this is a well-implemented feature.

python/ray/serve/_private/common.py

python/ray/serve/_private/deployment_scheduler.py

gemini-code-assist · 2026-02-06T07:57:39Z

python/ray/serve/_private/deployment_scheduler.py

+            except Exception:
+                pass


Catching a broad Exception and passing silently can hide important issues during cleanup. While cleanup should be robust, it's better to at least log the exception to aid in debugging potential problems. For example, if there's a permission issue or a problem with the GCS connection, we would want to know about it.

Suggested change

except Exception:

pass

except Exception as e:

logger.warning(f"Failed to remove placement group {pg.id}: {e}")

gemini-code-assist · 2026-02-06T07:57:39Z

python/ray/serve/_private/deployment_state.py

+                gang_placement_strategy=gang_config.gang_placement_strategy.value
+                if hasattr(gang_config.gang_placement_strategy, "value")
+                else str(gang_config.gang_placement_strategy),


The logic to get the string value of gang_placement_strategy can be simplified. Since GangPlacementStrategy inherits from str and Enum, you can just use str() to get the string value. This is cleaner and less defensive.

gang_placement_strategy=str(gang_config.gang_placement_strategy),

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-02-06T19:06:37Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The implementation is comprehensive, touching core components like the deployment scheduler and state manager. New dataclasses for gang context and requests are added, along with configuration options in DeploymentConfig and the @serve.deployment decorator. The logic for reserving placement groups for gangs and handling the RESTART_GANG failure policy seems well-thought-out. The changes are supported by a good set of unit and end-to-end tests. My review includes a few minor suggestions for improving robustness and code clarity.

gemini-code-assist · 2026-02-06T19:09:05Z

python/ray/serve/_private/config.py

+        if num_replicas % v.gang_size != 0:
+            raise ValueError(
+                f"num_replicas ({num_replicas}) must be a multiple of "
+                f"gang_size ({v.gang_size})."
+            )


The num_replicas value could potentially be None, which would cause a TypeError when the modulo operator is used. While pydantic's default value handling might prevent this, adding a check for num_replicas is not None would make this validator more robust against unexpected None values.

Suggested change

if num_replicas % v.gang_size != 0:

raise ValueError(

f"num_replicas ({num_replicas}) must be a multiple of "

f"gang_size ({v.gang_size})."

)

if num_replicas is not None and num_replicas % v.gang_size != 0:

raise ValueError(

f"num_replicas ({num_replicas}) must be a multiple of "

f"gang_size ({v.gang_size})."

)

gemini-code-assist · 2026-02-06T19:09:06Z

python/ray/serve/_private/deployment_scheduler.py

+                    f"num_replicas_to_add {request.num_replicas_to_add} "
+                    f"is not divisible by gang_size {gang_size}. "


There's a trailing space in the f-string for the error message, which should be removed for cleaner output.

Suggested change

f"num_replicas_to_add {request.num_replicas_to_add} "

f"is not divisible by gang_size {gang_size}. "

f"num_replicas_to_add {request.num_replicas_to_add} "

f"is not divisible by gang_size {gang_size}."

python/ray/serve/_private/deployment_scheduler.py

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-02-06T22:46:04Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-02-06T22:58:13Z

python/ray/serve/_private/config.py

+                f"gang_size ({v.gang_size})."
+            )
+
+        return v


Validator crashes when num_replicas is None

High Severity

The validate_gang_scheduling_config validator performs num_replicas % v.gang_size without checking if num_replicas is None. The num_replicas field is Optional[NonNegativeInt] and can be None when autoscaling is used (e.g. num_replicas="auto"). Additionally, in Pydantic v1, if num_replicas fails its own validation, it won't be present in values, causing values.get("num_replicas") to return None. In either case, None % v.gang_size raises a TypeError.

cursor · 2026-02-06T22:58:13Z

src/ray/protobuf/serve.proto

  int32 max_constructor_retry_count = 20;
+
+  // Gang scheduling configuration for atomic replica scheduling.
+  GangSchedulingConfig gang_scheduling_config = 21;


Proto file modified requires fault-tolerance review notice

Low Severity

This PR modifies src/ray/protobuf/serve.proto. Per the "RPC Fault Tolerance Standards Guide" rule:

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

cursor · 2026-02-06T22:58:13Z

python/ray/serve/_private/deployment_state.py

+                    # Gang PGs are shared across multiple replicas.
+                    # Another replica in the same gang may have already
+                    # removed this PG.
+                    pass


Gang PG silently removed while siblings still running

Medium Severity

When a gang replica finishes stopping, check_stopped removes the shared placement group. Since gang PGs are shared, the first replica to stop removes the PG while sibling replicas may still be running on it. This is especially problematic during downscaling (which has no gang awareness) — individual replicas can be selected for removal, causing the shared PG to be deleted from under still-active gang members. The broad except Exception: pass also silently swallows unrelated errors during PG cleanup.

jeffreywang-anyscale · 2026-02-07T07:10:35Z

Ready for a first pass while I keep adding tests.

jeffreywang-anyscale added 4 commits February 5, 2026 22:11

Define GangSchedulingConfig

464c44b

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Ensure atomic gang scheduling

95d7e5c

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Persist GangContext in ReplicaContext

1a4f124

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

Implement RESTART_GANG runtime failure policy

55da6cc

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

Fix linter

4b69ebd

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

jeffreywang-anyscale force-pushed the gang-scheduling-v2 branch from 537abee to 4b69ebd Compare February 6, 2026 16:40

Clean up

e3f89d0

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

Deployment stays in updating phase when not all gangs can be scheduled

f3d4c74

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>

jeffreywang-anyscale requested a review from abrarsheikh February 6, 2026 22:52

cursor bot reviewed Feb 6, 2026

View reviewed changes

jeffreywang-anyscale changed the title ~~[serve][wip] Introduce gang scheduling~~ [serve][1/n] Introduce gang scheduling Feb 7, 2026

jeffreywang-anyscale marked this pull request as ready for review February 7, 2026 07:10

jeffreywang-anyscale requested a review from a team as a code owner February 7, 2026 07:10

ray-gardener bot added the community-contribution Contributed by the community label Feb 7, 2026

-            except Exception:
-                pass
+            except Exception as e:
+                logger.warning(f"Failed to remove placement group {pg.id}: {e}")

		f"num_replicas_to_add {request.num_replicas_to_add} "
		f"is not divisible by gang_size {gang_size}. "

Conversation

jeffreywang-anyscale commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key decisions

Out of scope for this PR

Test approach

Basic validation

Failure validation

Placement strategy

Failover / Fault Tolerance

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffreywang-anyscale commented Feb 6, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Validator crashes when num_replicas is None

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Proto file modified requires fault-tolerance review notice

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Gang PG silently removed while siblings still running

Uh oh!

jeffreywang-anyscale commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffreywang-anyscale commented Feb 6, 2026 •

edited

Loading

Validator crashes when `num_replicas` is None